Goto

Collaborating Authors

 Imaging


Learning to Zoom with Anatomical Relations for Medical Structure Detection

Neural Information Processing Systems

Accurate anatomical structure detection is a critical preliminary step for diagnosing diseases characterized by structural abnormalities. In clinical practice, medical experts frequently adjust the zoom level of medical images to obtain comprehensive views for diagnosis.


GauSAM: Contour-Guided 2DGaussian Fields for Multi-Scale Medical Image Segmentation with Segment Anything

Neural Information Processing Systems

Effective multiscale medical image segmentation requires simultaneously preserving smooth spatial continuity and accurately delineating high-frequency boundaries, yet pixel-wise decoders often fail to maintain this balance consistently across varying resolutions. We introduce GauSAM, which seamlessly integrates contour-guided 2DGaussian probability fields into the Segment Anything Model to address these challenges. In our framework, segmentation masks are parameterized as continuous probability fields of learnable 2DGaussian primitives, enforcing spatially smooth and structurally consistent. Contourlet transforms extract rich multidirectional frequency information, notably edges and fine textures, which dynamically guide the spatial distribution of Gaussian primitives to substantially improve boundary fidelity in complex structures.


Causally Reliable Concept Bottleneck Models

Neural Information Processing Systems

Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable variables, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose Causally reliable Concept Bottleneck Models (C2BMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and unstructured background knowledge (e.g., scientific literature). Experimental evidence suggests that C2BMs are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t.


SHF: Symmetrical Hierarchical Forest with Pretrained Vision Transformer Encoder for High-Resolution Medical Segmentation

Neural Information Processing Systems

This paper presents a novel approach to addressing the long-sequence problem in high-resolution medical images for Vision Transformers (ViTs). Using smaller patches as tokens can enhance ViT performance, but quadratically increases computation and memory requirements. Therefore, the common practice for applying ViTs to high-resolution images is either to: (a) employ complex sub-quadratic attention schemes or (b) use large to medium-sized patches and rely on additional mechanisms within the model to capture the spatial hierarchy of details. We propose Symmetrical Hierarchical Forest (SHF), a lightweight approach that adaptively patches the input image to increase token information density and encode hierarchical spatial structures into the input embedding. We then apply a reverse depatching scheme to the output embeddings of the transformer encoder, eliminating the need for convolution-based decoders. Unlike previous methods that modify attention mechanisms or use a complex hierarchy of interacting models, SHFcan be retrofitted to any ViT model to allow it to learn the hierarchical structure of details in high-resolution images without requiring architectural changes. Experimental results demonstrate significant gains in computational efficiency and performance: on the PAIPWSI dataset, we achieved a 3 32 speedup or a 2.95% 7.03% increase in accuracy (measured by Dice score) at a 64K2 resolution with the same computational budget, compared to state-of-the-art production models. On the 3D medical datasets BTCV and KiTS, training was 6 faster, with accuracy gains of 6.93% and 5.9%, respectively, compared to models without SHF.


Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum

Neural Information Processing Systems

Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully selfsupervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-tonoisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise (ฯƒ = 255, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining.


The Boundaries of Fair AI in Medical Image Prognosis: ACausal Perspective

Neural Information Processing Systems

As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.


caSub Pair xt .

Neural Information Processing Systems

Omit references to the index or number of the sub-images, such as (xx), left, right, etc.3. There might be a common prefix or suffix caption shared among all sub-images at the beginning, end, or within the caption. Please incorporate the prefix or suffix into each sub-image's caption. If one subcaption contains context for multiple other subcaptions, add that context to each of the relevant subcaptions.4. The final output should be in JSON format, with an outer field'subcaptions', with a value that is a list of'subfigure' and'subcaption' dictionaries.5. If a subfigure contains more nested figures, i.e. subfigure (A) contains references to (left) and (right), add a field called "location" that stores the "left" or "right".6. If there are no references to sub-images, give a single subcaption with label "A".User Prompt:You are a research paper processor which splits the captions of figures into sub-captions that correspond with subfigures.System Prompt:"(a) H&E image of a breast tumor tissue. Fluorescently labeled markers superimposed as green color on the H&E image, (b) \u03b2-catenin, (c) pan-keratin, and (d) smooth muscle \u03b1-actin, markers.":{"subcaptions":


Connecting Medical Vision

Neural Information Processing Systems

Multi-modal models are data hungry. While datasets with natural images are abundant, medical image datasets can not afford the same luxury. To enable representation learning for medical images at scale, we turn to YouTube, a platform with a large reservoir of open-source medical pedagogical videos. We curate MedicalNarratives, a dataset 4.7M medical image-text pairs, with 1M samples containing dense annotations in the form of spatial traces (and bounding boxes), and 118K videos centered on the trace event (with aligned text), enabling spatiotemporal grounding beyond single frames. Similar to think-aloud studies where instructors speak while hovering their mouse cursor movements over relevant image regions, 1M images in MedicalNarratives contains localized mouse traces in image pixels, creating a spatial and temporal association between the text and pixels. To evaluate the utility of MedicalNarratives, we train GENMEDCLIP with a CLIP-like objective using our dataset spanning 12 medical domains. GENMEDCLIP outperforms previous state-of-the-art models on all 12 domains on a newly constructed medical imaging benchmark.


Toward Artificial Palpation: Representation Learning of Touch on Soft Bodies

Neural Information Processing Systems

Palpation, the use of touch in medical examination, is almost exclusively performed by humans. We investigate a proof of concept for an artificial palpation method based on self-supervised learning. Our key idea is that an encoder-decoder framework can learn a representation from a sequence of tactile measurements that contains all the relevant information about the palpated object. We conjecture that such a representation can be used for downstream tasks such as tactile imaging and change detection. With enough training data, it should capture intricate patterns in the tactile measurements that go beyond a simple map of forces - the current state of the art. To validate our approach, we both develop a simulation environment and collect a real-world dataset of soft objects and corresponding ground truth images obtained by magnetic resonance imaging (MRI). We collect palpation sequences using a robot equipped with a tactile sensor, and train a model that predicts sensory readings at different positions on the object. We investigate the representation learned in this process, and demonstrate its use in imaging and change detection.


Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Neural Information Processing Systems

Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset comprises 1152 physiciancurated clinical vignettes structured as interactive scenarios that simulate a viva voce examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. We evaluated several state-of-the-art LLMs and found that while models demonstrate competence in diagnosing conditions within well-described clinical presentations, their performance degrades significantly when required to navigate diagnostic uncertainty. Our analysis identified several failure modes that mirror common issues in clinical practice, including: (1) fixation on initial hypotheses, (2) excessive investigation ordering, (3) premature diagnostic closure, and (4) missing critical conditions. These patterns reveal fundamental limitations in how current LLMs manage uncertainty and gather information sequentially. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.